Density and Non-Grid based Subspace Clustering via Kernel Density Estimation
نویسندگان
چکیده
Instead of finding clusters in full data space, subspace clustering has been an emergent task which aims at discovering clusters embedded in subspaces. Most of the previous works in the literature are grid and density based approaches, where the performance of clustering algorithms heavily depends on the partition of grids which may incur the serious loss of the real data distribution. In order to address the dilemma of grid partition, in this paper we propose a Density and Non-Grid based Subspace Clustering (DNGSC) algorithm via Kernel Density Estimation, which is able to discover arbitrarily shaped subspace clusters and insensitive to the only input parameter. The Kernel Density Estimation method with its firm mathematical basis has the ability to accurately uncover the underlying boundaries of data without any presumed canonical distribution. Besides, in order to deal with “the density divergence problem” which means the region densities vary in difference subspace cardinalities, different density thresholds for different subspaces are automatically determined to effectively discover arbitrarily shaped clusters in all subspaces, and FP-tree structure is employed to mine all dense subspace clusters with two efficient pruning strategies. To demonstrate the effectiveness and efficiency of DNGSC, we conduct extensive experiments on both synthetic and real data sets, and the performance comparison on accuracy and efficiency with DENCOS, MAFIA and CLIQUE shows the superiority of our new algorithm.
منابع مشابه
Gaussian Mixture Models with Component Means Constrained in Pre-selected Subspaces
We investigate a Gaussian mixture model (GMM) with component means constrained in a pre-selected subspace. Applications to classification and clustering are explored. An EM-type estimation algorithm is derived. We prove that the subspace containing the component means of a GMM with a common covariance matrix also contains the modes of the density and the class means. This motivates us to find a...
متن کاملComparison of the Gamma kernel and the orthogonal series methods of density estimation
The standard kernel density estimator suffers from a boundary bias issue for probability density function of distributions on the positive real line. The Gamma kernel estimators and orthogonal series estimators are two alternatives which are free of boundary bias. In this paper, a simulation study is conducted to compare small-sample performance of the Gamma kernel estimators and the orthog...
متن کاملKernel Density Estimation for Text-Based Geolocation
Text-based geolocation classifiers often operate with a grid-based view of the world. Predicting document location of origin based on text content on a geodesic grid is computationally attractive since many standard methods for supervised document classification carry over unchanged to geolocation in the form of predicting a most probable grid cell for a document. However, the grid-based approa...
متن کاملProjection Pursuit via Decomposition of Bias Termsof Kernel Density
Dimension reduction of data, < d ! < p (p << d), to be used for clustering has speciic requirements that are not generally met by generic dimension reduction algorithms such as principal components. Projection pursuit, on the other hand, has a growing variety of criteria that target holes, skewness, etc., using information measures, density functionals, sample moments, etc. With the exception o...
متن کاملتشخیص سرطان پستان با استفاده از برآورد ناپارمتری چگالی احتمال مبتنی بر روشهای هستهای
Introduction: Breast cancer is the most common cancer in women. An accurate and reliable system for early diagnosis of benign or malignant tumors seems necessary. We can design new methods using the results of FNA and data mining and machine learning techniques for early diagnosis of breast cancer which able to detection of breast cancer with high accuracy. Materials and Methods: In this study,...
متن کامل